Dimensions of fractals related to languages defined by tagged 
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Abstract 



A representation of frequency of strings of length K in complete genomes of many organisms 
c/3 . in a square has led to seemingly self-similar patterns when K increases. These patterns are 

^ I caused by under-represented strings with a certain "tag" -string and they define some fractals in 

t^ . the K ^ oo limit. The Box and Hausdorff dimensions of the limit set are discussed. Although 

the method proposed by Mauldin and Williams to calculate Box and Hausdorff dimension is 
valid in our case, a different and sampler method is proposed in this paper. 
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2 '■ 1 Introduction 

O 

^D ■ In the past decade or so there has been a ground swell of interest in unraveling the mysteries 



of DNA. The heredity information of organisms (except for so-called RNA-viruses) is encoded in 
Qv^ I their DNA sequence which is a one-dimensional unbranched polymer made of four different kinds 

"^ ' of monomers (nucleotides): adenine (a), cytosine (c), guanine {g), and thymine (t). As long as the 

^ . encoded information is concerned we can ignore the fact that DNA exists as a double helix of two 

^ I "conjugated" strands and only treat it as a one-dimensional symbolic sequence made of the four 

(— I ■ letters from the alphabet S = {a, c, g, t}. Since the first complete genome of a free-living bacterium 

Oh' Mycoplasma genitalium was sequenced in lOGSO , an ever-growing number of complete genomes has 

been deposited in public databases. The availability of complete genomes opens the possibility to 
k><( ; ask some global questions on these sequences. One of the simplest conceivable questions consists in 

;_, ' checking whether there are short strings of letters that are absent or under-represented in a complete 



genome. The answer is in the affirmative and the fact may have some biological meaningO. 

The reason why we are interested in absent or under-represented strings is twofold. First of all, 
this is a question that can be asked only nowadays when complete genomes are at our disposal. 
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Foundation. 



Second, the question makes sense as one can derive a factorizable language from a complete genome 
which would be entirely defined by the set of forbidden words. 

We start by considering how to visualize the avoided and under-represented strings in a bacterial 
genome whose length is usually the order of a million letters. 

Bai-lin Hao O et al. proposed a simple visualization method based on counting and coase- 
graining the frequency of appearance of strings of a given length. When applying the method to 
all known complete genomes, fractal-like patterns emerge. The fractal dimensions are basic and 
important quantities to characterize the fractal. One will naturally ask the question: what are the 
fractal dimensions of the fractals rerlated to languages defined by tagged strings? In this paper we 
will answer the question. 

2 Graphical representation of counters 

We call any string made of K letters from the set {g, c, a, t} a i('-string. For a given K there are 
in total 4 different X-strings. In order to count the number of each kind of i^-strings in a given 
DNA sequence A^ counters are needed. These counters may be arranged as a 2^ x 2^ square, as 
shown in Fig. [l| for i(' = 1 to 3. 
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Figure 1: The arrangement of string counters for if = 1 to 3 in squares of the same size. 



In fact, for a given K the corresponding square may be represented as a direct product of K 
copies of identical matrices: 

M(^) =M(g)M(g) 



®M, 



where each M is a 2 x 2 matrix: 



M 



9 c 
a t 



which represents the K = 1 square in Fig. |l[ For convenience of programming, we use binary digits 
and 1 as subscripts for the matrix elements, i.e., let Mqq = g, Mqi = c, Miq = a, and Mn = t. 
The subscripts of a general element of the 2^ x 2^ direct product matrix M^^' , 



M 



i,J 



-'"njl-'l^J2i2 ■ ■ ■ ■'■"iKJK 



are given hy I = iii2- ■ ■ ix and J = jiJ2 ■ ■ ■ Jk- These may be easily calculated from an input DNA 
sequence 

S1S2S3---SKSK+1--- , 

where Sj G {g,c,a,t}. We call this 2^ x 2^ square a K-fvame. Put in a frame of fixed K and 
described by a color code biased towards small counts, each bacterial genome shows a distinctive 
pattern which indicates on absent or under-represented strings of certain typesH. For example, 
many bacteria avoid strings containing the string ctag. Any string that contains ctag as a substring 
will be called a ctag'-tagged string. If we mark all cta^-tagged strings in frames of different K, we 
get pictures as shown in Fig. |2|. The large scale structure of these pictures persists but more details 
appear with growing K. Excluding the area occupied by these tagged strings, one gets a fractal F 
in the A' — > oo limit. It is natural to ask what are the fractal dimensions of F for a given tag. 
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Figure 2: cta^-tagged strings in i^ = 6 to 9 frames. 

In fact, this is the dimension of the complementary set of the tagged strings. The simplest case 
is that of ^-tagged strings. As the pattern has an apparently self-similar structure the dimension 



is easily calculated to be 

dim^(F) = dhuBiF) = 1^, 

log 2 

where dim//(F) and dim.B{F) are the Hausdorff and Box dimensionso of F. 

In formal language theory, we starts with alphabet S = {a, c, g, t}. Let S* denotes the collection 
of all possible strings made of letters from S, including the empty string e. We call any subset 
L C S* a language over the alphabet S. Any string over S is called a word. If we denote the given 
tag as wq, for our case, 

L = {word which does not contain wq as factor}. 

F is called the fractal related to language L. 

3 Box dimension of fractals 

When we discuss the Box dimension, we can consider more general case, i.e. the case of more 
than one tag. We denote the set of tags as B, and assume that there has not one element being 
factor of any other element in B. We define 

Li = {word which does not contain any of element of B as factor} 

Now let ax be the number of all strings of length K that belong to language Li. As the linear 
size Sk in the ET-frame is 1/2^, the Box dimension of F may be calculated as: 

dimB(F) = lim — —- = lim — — . (1) 

K^oo - log 6k K^oo log 2 ^ ' 

Now we define the generating function of ax as 

oo 
K=0 

where s is a complex variable. 

First Li is a dynamic language, form Theorem 2.5.2 of ref.|jl^, we have 

ii"— >oo 

Prom (|l|), we have 



lim Qfl exists, we denote it as I. (2) 

ii"— >oo 



dim.(F) = Ig. (3) 

For any word w = W1W2 ■ ■ ■ Wn, u;j G S for z = 1, . . . , n, we denote 

Head{w) = {wi, W1W2, W1W2W3, ..., W1W2 ■ ■ ■ Wn-i} , 

Tail{w) = {Wn, Wn-lWn, Wn^2Wn-lWn, ■.., W2W3 . . . Wn} ■ 



For given two words u and v, we denote overlap{u,v) = Tail{u) n Head{v). If x E Head{v), then 
we can write v = xx' . We denote x' = v/x and define 



u : V 



E 



\v/x\ 



x(^overlap{u,v) 

where \v/x\ is the length of word v/x. From Golden-Jackson Cluster methodH, we can know that 

f{s) 



1 



1 — 4s — weight{C) ' 

where weight{C) = Z^i^gb '"^^^5^*(^M) ^^^ weight{C[v]) {v G B) are solutions of the linear equa- 
tions: 

weight{C[v]) = — s'''' — [v : v)'weight{C[v]) — ^(n : v)weight{C[u]). 

ueB 

It is easy to see that /(s) is a rational function. Its maximal analytic disc at center has radius 
|so|, where sq is the minimal module zero point of /~^(s). On the other hand, according to the 
Cauchy criterion of convergence we have l/l is the radius of convergence of series expansion of f{s). 
Hence |so| = l/l- From (|3|), we obtain the following result. 



Theorem 3.1 The Box dimension of F is 



dims(F) 



log I So I 
' log2 ' 



where Sq is the minimal module zero point ofl/f{s) and f{s) is the generating function of language 

In particular, the case of a single tag — B contains only one word — is easily treated and some 
of the results are shown in Table |l[ 
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Table 1: Generating function and dimension for some single tags. 



4 Hausdorff dimension of fractals 

We obtained the Box dimension of F in the previous section. Now one wih naturally ask whether 
the Hausdorff dimension of F equals to the Box dimension of it. In this section we will discuss the 
Hausdorff dimension of F. Now we only discuss the case of B contains only one word wq. From 
the i^-frames {K = \wo\, {wqI + 1, . . .), we can find: 



Proposition 4.1 



log 3 logf4l"'ol - 1) 

-^ < dim^(F) < diuiBiF) < ^\ ^ > < 2. 
log 2 log 2 



111 1 ''^^'^ 

Now we denote a = — "^^2° ^^"^ '^^ ~ °^"^ . 

For any word w = wiW2 ■ ■ ■ wk, we denote Fwj^w2...wk the corresponding close square in i^T-frame 

and denote 

then F = liuiK^oo Fk- 

We first prove dim//(F) = dimB(-^) under a condition using elementary method. 

Lemma 4.1 .• Suppose E cH^ with \E\ < 1/2, let 

Bi = {w = wiW2 ...WK&L: |-F«,i«,2.-«'ifl < 1^1 ^ \FwiW2...WK-i\ 

and E n -F^i^j-wk / ^l' 



then #Bi < 2tt. 






Proof. Note that for each w = 


= WiW2 ■ ■ ■ WK G Bi 
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then \E\ < ^\Fu)iw2...wk\- The interiors of F^-^^^^-.-wk with w = W1W2 ■ ■ -wk £ -Bi are non- 
overlapping and all lie in a disc with radius 2\E\, and all Fw-^w2.--wk ^'^^ squares, hence 

{2\E\f^ > [l=\F^^^^,„^^\f#B^ > ^{2\E\f#B,, 

hence #i?i < 27r. 
D 

For any w = wi . . . ty|^|, r G E, we denote w * r = wi . . . W|^|r and define 
where 

_ f 2"/4, if #{r G S : W1W2 ■ ■ ■ Wj^ir £ L} = 4, 
^wj - I 2«/3, if #{r G S : W1W2 ■ ■ ■ Wj-ir G L} = 3. 

We assume 

(Ci) Vw = i^wii^w2 ■ ■ ■ ^wi^i < M (a constant) for any w £ L. 

Now we have: 



^W — ^UIl ^11)2 



Theorem 4.1 Under condition (Ci), we have 

dimH(i^) = diuiBiF) =a and < W(F) < oo, 

where TC"{F) is the Hausdorff measure of F. 

Proof. We first prove that 

7t:"(F) < oo, (4) 

Since ax -^ a as K ^ oo, for any small e > 0, there exists a integer A^ > such that for any 
K > N, we have a > ax — £• Hence 



\'t^wiW2...wj(\ — (^K\ 

w=wiW2...wkGL 



E l^-i-2...»Kr = aK{^f''<aK{-)''^''--'^ 



(1)-^^ < (^)-(^+^)^ < oo. 



Hence W°(F) < oo. 

Now we want to prove 7i'^{F) > 0. We denote 



S°° = {r = T1T2 . . . : |r| = 00 and ti ... tk & L ioi K = 1,2, .. .} 

For any r = T1T2 ... G S°°, we denote t\k = T1T2 ■ . . tk, and define a probability measure Ji on 

S°° by 



We can see 



HH) = (^)'"''"'^«., where [w] = {r G S^ : t||^| = u;}. 



.1. 



J2 K[w*r]) = E {-r'^"^'"l'w*r 



r_)(kl+i)" 

There exists a natural continuous map / from S°^ to F. Now we transfer /x to a probability 
measure on F, let /j, = p o f^^. We will show that there is some constant Mi > such that if E 
is a Borel subset of R^ with \E\ < 1/2, then fJ-{E) < Mi\E\°'. Of course, this inequality implies 
W°(F) > 1/Mi > 0. 
Set 

Bi = {w = W1W2...WK G i : |-F'«,i«,2...«'xl < l-^l ^ l-^«;i«;2...«'K-il 

and E f] FwiW2...wk / 0}- 



Then 






D 



Theorem 4.2 If the length of tag \wo\ > 3 and for any w E L, y^ has the form 

^OL no. oa ncx ncx 

.. = (y)(^)"(y)(^)"(y)- 
or 

^. = (^r(y)(^r(y)(^r--- 

where 11,12 and 13 are positive integers, then dim/f(F) = dim.B{F) = a and < 7i°'{F) < 00. 

log 12 

2 log 2 ' 



Proof. Since \wo\ > 3, we have a > ^w 2 ' hence 



Form the other condition, we know that there exists Mi = niax{(^),l} such that u^ < Mi for 



any w ^ L. Then from Theorem 4.1, we can obtain our result of this theorem. 

D 

Examples: wq = ctg or wq = ctag, the result dim//(F) = dimB(F) holds. 

If we do not have condition (Ci), in the following we still can obtain dim//(F) = diin.B{F). 

We define B2 = {u (^ T,*\ \u\ = \wo\,u 7^ wq}. One can know the set B2 contains A^i = 4i'"ol — 1 
elements, hence we can write B2 = {ui,U2, ■ ■ ■ , un^}- Now we can define a A^'i x A^i matrix A by 

•^ ['^i,j\i,j<Ni) 

where tij = (1/2)^ if Ui = rix and Uj = xr2 with |rE| = \wq\ — l,ri,r2 G E, and tjj = otherwise, 
and where [3 satisfies $(/?) = 1 with $(/3) being the largest nonnegative eigenvalue of A. Then 
from the results of ref. [^ , we have 

Theorem 4.3 If B = {wq}, then 

dimniF) = dimB(F) = (3 and < W(F) < cx). 

From Theorem |3.1| and Theorem |4.1| , we have 

Corollary 4.1 If B = {wq}, then 

13 = dimi:^ (F) = dimB(F) = a. 



Remark: When B contains more than one word, we can also construct a matrix A similarly, 
then from the results of ref.|0], we can obtain the same conclusions of Theorem 4.3 and Corollary 



4.1 for this case. From Corollary 4.1, we have two methods to calculate the Hausdorff and Box 



dimensions of F, i.e. calculate a and (3 respectively. 
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